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Abstract 

We describe a method for fast approximation of sparse 
coding. The input space is subdivided by a binary de- 
cision tree, and we simuhaneously learn a dictionary and 
assignment of allowed dictionary elements for each leaf of 
the tree. We store a lookup table with the assignments and 
the pseudoinverses for each node, allowing for very fast 
inference. In the process of describing this algorithm, we 
discuss the more general problem of learning the groups 
in group structured sparse modelling. We show that our 
method creates good sparse representations by using it in 
the object recognition framework of [1, 2|. Implement- 
ing our own fast version of the SIFT descriptor the whole 
system runs at 20 frames per second on 321 x 481 sized 
images on a laptop with a quad-core cpu, while sacrific- 
ing very little accuracy on the Caltech 101 and 15 scenes 
benchmarks. 



fast approximate algorithm for finding sparse representa- 
tions; we use this algorithm to build a system with near 
state of the art recognition performance that runs in real 
time. During inference the algorithm uses a tree to as- 
sign an input to a group of allowed dictionary elements 
and then finds the corresponding coefficient values us- 
ing a cached pseudoinverse. We give an algorithm for 
learning the tree, the dictionary and the dictionary ele- 
ment assignment, and along the way discuss methods for 
the more general problem of learning the groups in group 
structured sparse modelling. 

One standard formulation of sparse coding is to con- 
sider N d-dimensional real vectors X = {xi, . . . , x^} 
and represent them using N i^-dimensional real vectors 
Z — {zi, . . . , zn} using a k X d dictionary matrix W by 
solving 



iWzk-XkW^, s.t. Ilzfcllo < q, (1) 



1 Introduction 

Sparse modeling 13] |4l has proven to be a useful frame- 
work for signal processing. Each point from a dataset con- 
sisting of vectors in a Euclidean space is represented by a 
vector with only a few nonzero coefficients. Sparse mod- 
eling has lead to state of the art algorithms in image de- 
noising, inpainting, supervised learning, and of particular 
interest here, object recognition. The systems described 
in im |2] |5] |6] IT] use sparse coding as an integral element. 
Since the coding is done densely in an image with rela- 
tively large dictionaries, this is a computationally expen- 
sive part of the recognition system, and a barrier to real 
time application. The main contribution of this paper is a 



where 1 1 • | |o measures the number of nonzero elements of a 
vector; each input vector x is thus represented as a vector 
z with at most q nonzero coefficients. While this problem 
is not convex, and in fact the problem in the Z variable is 
NP-hard, there exist algorithms for solving both the prob- 
lem in Z (e.g. Orthogonal Matching Pursuit, OMP []) and 
the problem in both variables (e.g. K-SYD ||4l) that work 
well in many practical situations. 

It is sometimes appropriate to enforce more structure 
on Z than just sparsity. For example, many authors have 
noted that the solution to the z minimization in ([T]l (and 
its li relaxation) is very unstable in the sense that nearby 
inputs can have very different coefficients, in part because 
of the combinatoriaUy large number of possible active sets 
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(i.e. sets of nonzero coordinates of z). This can be a prob- 
lem in classification tasks. Other times we may know in 
advance some structure in the data that the coefficients 
should preserve. Various forms of structured sparsity are 
explored in milllOl HI]. 

A simple form of structured sparsity is given by speci- 
fying a list of L allowable active sets, and some function 
g : R** M- {1, L} associating to each x to one of the L 
configurations. An example of this is the output of many 
subspace clustering algorithms. There, X is reordered 
and partitioned into PX — [Xi X2... Xi,] (where P is 
a permutation matrix), so that each block Xj is near a 
low dimensional subspace spanned by Bj . Supposing for 
simplicity that each of the Bj are of the same dimension 
q, then if we set W = [Bi...Bl], the allowable active sets 
are given by {1, {q + 1, ...,2q}, etc. By setting 

the allowable active sets to the blocks, and the function g 
to simply map each point to its nearest subspace (say in 
the standard sense of Euclidean projections), then we get 
an example of structured sparsity as described above; this 
sort of method is used in object recognition in jS]. 

In this work we will try to learn the L configura- 
tions as well as the dictionary. We introduce a LLoyd- 
like algorithm that alternates between updating the dictio- 
nary, updating the assignments of each data point to the 
groups, and updating the dictionary elements associated 
to a group via simultaneous OMP 1 12|. 

At inference time, we need a fast method for determin- 
ing which group an x belongs to. This is computationally 
expensive if there is a large number of groups and one 
needs check the projection onto each group. However, by 
specializing the Lloyd type algorithm to the case when 
each group is composed of a union of (perhaps only one) 
leaves of a binary decision tree, we will build a fast infer- 
ence scheme into the learned dictionary. The key idea is 
that by using SOMP, we can learn which leaves should use 
which dictionary elements as we train the dictionary. To 
code an input, we march it down the tree until we arrive 
at the appropriate leaf. In addition to the decision vectors 
and thresholds, we will store a lookup table with the active 
set of each leaf as learned above, and the pseudoinverse of 
the columns of W corresponding to that active set. Thus 
after following x down the tree we need only make one 
matrix multiplication to get the coefficients. 

Finally, we would like use these algorithms to build an 
accurate real time recognition system. We focus on a par- 



ticular architecture studied in |[T1 |2l |6l |71 . First, SIFT de- 
scriptors are calculated densely over the image. Then (a 
form of) sparse coding is used to calculate a sparse vector 
at every location from the corresponding sift vector. Then 
each feature is pooled over a small number of spatial re- 
gions and the results are concatenated. Finally the labels 
are obtained using linear SVM or logistic regression. 

We use this pipeline with two modifications. First we 
write our own fast implementation of the SIFT descriptor 
Second we use our fast algorithm for the sparse coding 
step. The resulting system achieves nearly the same per- 
formance as exact sparse coding calculation but processes 
321 X 481 size images at the rate of 20 frames per second 
on a laptop computer with a quad core cpu. 

The rest of this paper is organized as follows: Section 
2, we discuss greedy structured sparse modeling, and de- 
scribe in depth how to train a model that learns the struc- 
ture, and that respects a given set of groups given by a 
tree. In section 3, we show experiments on image patches 
to qualitatively demonstrate what learned groups look like 
and then we apply our methods to object recognition. 

2 Hashing and dictionary learning 

2.1 A simple form of structured dictionary 
learning 

Here we will first suppose that a list of L perhaps over- 
lapping groups Gi, ...,Gl on the coefficients Z is given. 
That is, if we are learning a representation of X with K 
atoms, each Gi C 7^({1, where V is the set of 

all subsets of its argument, is specified. We can general- 
ize the LLoyd algorithm for K means or K flats to this 
setting. After initializing the dictionary W, we find the 
distance of each a; in X to its projection P^x onto the 
span of for each i. Each x is associated to the i with 
the smallest distance 

X ^ argmin^g^i^,, _^}||PG.a; - a;|p, (2) 

and we find the coefficients 

z ^ (wSWa^r^W^^x. 

Then we update W to be the minimum of the convex prob- 
lem 

argmin^y 1 1 — a; | p , 
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Algorithm ISOMP 1 12 1 

function Z = SOMP(X, W, K) 

Initialize: coefficients Z = 0, residual R = X, 

active set 17 = 0. 

repeat 

j = argmax,^^ \W^Rs\ 

n = n[jj 

Z={W^Wny'w^X 
R=X - WZ 
until K iterations 
end function 



and repeat. Each of the subproblems either has an explicit 
solution or is convex, and so the energy decreases. When 
the training is finished, we define g to be the function that 
maps each point in a; £ M'' to the i minimizing the error 
of the projection of x onto the span of Wd ■ 

We can also run the same sort of algorithm when in 
addition to each group G specifying a list of indices, it 
also specifies a cost for the use of the dictionary elements 
associated to each index. If we choose an I2 cost for each 
of the coefficients, we still get explicit updates and the 
decrease of energy at each round. 

Note that if the number of groups is very large, it may 
be too costly to find the best group for each x exaustively. 
However, we can make a greedy approximation by run- 
ning a modified OMR Here, supposing at iteration s of the 
OMP we have an active set Q,, the available dictionary ele- 
ments to add to are the union of all groups containing £7. 
It is not necessary to be able to enumerate all the groups 
to use this method, only to have a subroutine which given 
ri C {1, fc} can return UsicG However, using this 
sort of greedy approximation removes the guarantee that 
the energy decreases at each iteration. 

2.2 Learning the groups with simultaneous 
orthogonal matching pursuit 

In the previous section the groups were specified in ad- 
vance. If we want to learn the groups, we can add a step 
in the algorithm. Now instead of taking the list of groups 
as input, we instead input just the number K of dictionary 
elements and the number of coefficients allowed per x. 



After associating to each x the group that best represents 
it, we can turn around and consider all the x associated 
to that group. Our task is then to choose a subset of the 
dictionary that best represents that group. A greedy ap- 
proximation to this problem in the least squares sense is 
given by the Simultaneous Orthogonal Matching Pursuit 
algorithm (SOMP)[12|. This algorithm proceeds just as a 
standard OMP, but at each iteration, all the x associated to 
a given group have to choose the next dictionary element 
added to the group together See algorithm[T| 

Unfortunately, because neither OMP nor SOMP is 
guaranteed to find the optimal solution to the NP hard 
problems they address, the energy may not decrease at 
each iteration with this scheme; however, as usual, we 
have found that in practice these methods do usually lead 
to a decrease in the energy. As in /-C-means, it may happen 
that no group uses a dictionary element; in such a situa- 
tion one can remove a dictionary element from one of the 
groups, find the residual, and replace the unused dictio- 
nary element by the principal component of the residual. 

We note that the model presented here can be thought 
of as a greedy sparse coding version of a "topic model". 
The dictionary elements act as the words, the x as the 
documents, and the groups are the topics. The algorithm 
learns the topics and the dictionary simultaneously. 

2.3 Hashing, quantization, and dictionary 
learning 

The main focus of this work will be choosing a g that 
can be computed rapidly and learning a dictionary that re- 
spects g. We will consider g to be a hash function on W^, 
and hash buckets will be the atomic units of the groups; 
that is, the groups will either be the hash buckets or will 
be glued together from the hash buckets. This can be con- 
sidered a sort of geometric regularization of the sparse 
coding problem: the active set will be forced to remain 
constant on the region of corresponding to each hash 
bucket. 

Once g is chosen, we will learn the dictionary (and per- 
haps groups) as above, but instead of allowing each x to 
choose the group that best represents it individually, the x 
in a hash bucket will need to choose the group that best 
represents them together on average. We will also try to 
approximate standard greedy dictionary learning; in this 
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Algorithm 2 Learning a dictionary and groups 
Require: data X, number of dictionary elements 
K, number of active coefficients per data point q, 
number of iterations /, and if desired, g : W'- ^ 
{1,...,M}. 
repeat 

1 : Each x chooses a group in via© 
or by the modified OMP as in Section IZTl If g 
is given, all the x in a hash bucket are forced to 
choose the same group. 

2: Each group in {!,..., L} chooses subset of 
{1, K} using Z = SOMP(X, W, k). 
3: Update W, either via X-SVD, or a least 
squares solve, 
until / iterations 



case, there will be one group for every hash bucket. As 
above, and as with A'-means, it may happen that no spa- 
tial bucket uses a particular group; in that case we can just 
pick a bucket at random and use the output of SOMP on 
that bucket to regenerate the unused group. 

Learning how to quantize is a much studied (but 
still not completely understood) problem. One common 
motivation is to build a data structure allowing nearest 
neighbors from a given data set to be quickly computed. 
Another common motivation is to use the buckets of the 
quantization as words to build bag of words feature repre- 
sentations. The relationship between vector quantization 
and sparse coding has studied before by many authors []. 
In particular, -means is simply Iq sparse coding with 
only the coefficients and 1 allowed, and only 1 nonzero 
per aQ. 

In this work we will use a 2-means tree with sub- 
divisions along medians to define g. We start by taking 
the entire data set and running 2-means, obtaining cen- 
ters ci and C2. We take each data point x G Xand find 
the angle between x and ci — C2; X is divided at the me- 
dian. We then repeat on each of the pieces, continuing 
until each piece is within a given distance to its mean, or 
a set depth p, whichever comes first. We initialize the 2- 
means with farthest insertion, as in iTT]. Note that our 

' "shape gain coding" allows a non-binary coefficient 
^Although perhaps not exactly standard usage, we will call the data 
structure obtained from binary partitions of a hash 



experience is that very few iterations are necessary, and 
really the farthest insertion is suffient; in fact cutting in 
random directions (with some additional tricks and ran- 
domizations) has been shown to lead to good partitions 
when the underlying data has a "manifold" structure, see 
lfT4l . The number of buckets at the bottom of the tree is 
upper bounded by 2"^; we will choose p small enough so 
that it is simple to store a lookup table with the indices 
into the dictionary for each bucket, as well as the decision 
vectors for each branch in the tree. 

We also could use mappings of the form g{x) = 
s{h{Hx + h)), where H is a p x d matrix, h is some 
sort of nonlinearity (e.g. tanh, or sin), b is an offset, and 
s is a thresholding function []. These mappings require 
less storage and are somewhat simpler to compute for the 
same bit depth, but on the data sets we work on, they have 
the disadvantage that many of the buckets are often empty 
or have very few entries for reasonable p. While this 
can be remedied by simply gluing (nearly) empty buck- 
ets to nearby full buckets and updating the lookup table, 
we have found the trees to work better. Note also that un- 
like in nearest neighbor data structures, it is unnecessary 
for leaf nodes to keep track of spatially nearby leaves that 
are far away in the tree metric, because all we care about 
is which dictionary atoms are used at that node. 

After building g and training the dictionary, in order 
compute the coefficients of a new data point x, we pass 
it though the tree, obtaining g{x). We lookup g{x) in 
a table, and this gives an index of m columns fl of 
W; at this point we solve the linear system Wqz — x 
to get the outputs. Alternatively, for each group, we 
can store {WqWq)~^ (or some stable factorization), or 
{WQWn)~^WQ, and just do the requisite matrix multi- 
plications 

2.4 Discussion of related work 

The idea of clustering the input space and then using a 
different dictionary for each cluster has appeared several 
times before. As mentioned in the introduction, a simple 
example is the X-flats algorithm, or other subspace clus- 
tering algorithms |15i. There, the subdictionaries serve 
the dual purpose of determining the clusters and also find- 
ing the coefficients for the data points associated to them. 
More recently this technique has been succesfully applied 
to object recognition by ||6l[7l. In those works, the clus- 



4 



ters are determined by K-means (or a Gaussian mixture 
model); in the first, there is a different dictionary for each 
cluster, and the code is the size of the union of all the sub- 
dictionaries, but only the blocks corresponding to the cen- 
troids near the input are nonzero. In the second work, the 
dictionaries for each centroid are the same, but the code is 
still a concatenation of the codes associated to each cen- 
troid (and are set to zero if the input does not belong to 
that centroid). The current work differes from these in 
two ways. The first is the use of a fast method for clus- 
tering, and the second is the use of shared parts across the 
dictionaries, where the organization of the parts sharing 
has been learned from the data. 

In fTE\ the authors construct a dictionary on the back- 
bone of a hierarchical clustering with fast evaluation. 
They also use shared parts. However, in that work the part 
sharing is determined by the tree structure of the cluster- 
ing, and not learned. 

There is now a large literature on structured sparsity. 
Like this work, fTT", TT] use a greedy approach for struc- 
tured sparse coding based on OMP or CoSaMP. Unlike 
this work, they have provable recovery properties when 
the true coefficients respect the structure, and when the 
dictionaries satisify certain incoherence properites. On 
the other hand, those works do not attempt to learn the dic- 
tionary, and only discuss the forward problem of finding 
z from X and M^.The works in |l8]|9][l2 use an approach 
to structured sparsity that allows for convex optimization 
in z. In these works the coefficients are arranged into a 
predetermined set of groups, and the sparsity term penal- 
izes the number of active groups, rather than the number 
of active elements; the dictionary is trained to fit the data. 
None of these works attempt to learn the group structure 
along with the dictionary 

Finally we note that other works have explored the idea 
of accelerating sparse coding by training the dictionary 
along with an approximation method, e.g. ||5] [T8]| . In the 
first, the approximation is via a single layer feed forward 
network, and in the second, via a multilayer feed forward 
network with a shrinkage nonlinearity. This work uses a 
tree and lookup table instead. 



3 Experiments 

3.1 What do the groups look Uke? 

To get a sense of what kind of groups learned from algo- 
rithm|2]look like, we train a dictionary on 500,000 8 x8 
image patches, and view the results. The image patches 
are drawn from the PASCAL dataset, and their means are 
removed. We train a dictionary with 256 elements and 
512 groups; each group has 5 dictionary elements in it. 
We train using the batch method, with a i^T-SVD update 
for the dictionary. 

After training, some of the dictionary elements are used 
by many groups, and others are used by only a few. The 
median number of groups using a given element is 6; 47 
elements are in exactly 1 group, and 15 are in more than 
30. In figure [T] we display the dictionary ordered by the 
number of groups containing each element; this number 
increases in each column and moving to the right. Unsur- 
prisingly, "popular" elements that belong to many groups 
are low frequency. In this figure we also show the groups 
containing a few chosen atoms. 

3.2 Review of the image classification 
pipeline 

Here we will review a standard pipeline for object recog- 
nition JXIIJI, while giving details about our implementa- 
tion, which streamlines certain components. It consists of 
the following parts: 1) Calculation of sift vectors at every 
location (sift grid) 2) Calculation of the feature vectors for 
every sift vector using the "tree sparse coding" described 
above, 3) Spatial pyramidal max pooling 4) logistic re- 
gression or SVM classification. Care is taken to calculate 
each of these parts efficiently. 

3.2.1 Sift grid 

We run tests with two different implementations of dense 
sift. The first is matlab code by L. Lazebnik [1 1. We also 
use a fast, approximate c-n- version that we coded our- 
selves. The details are as follows: 

The X and y derivatives. We convolve the image with 
two 5x5 filters that are the x and y derivatives of Gaus- 
sian. This results in the values of x and y derivatives 
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Figure 1: 256 dictionary atoms in 512 groups trained by algorithm |2] on 500,000 8 x 8 image patches. The group 
structure and dictionary were trained simultaneously. The dictionary elements, shown on top, are ordered by popularity 
(the number of groups they belong to). Underneath, for each dictionary atom in a colored square, we show all of the 
groups containing it. These groups can be thought of as "topics". Less popular atoms tend to be more specialized. 



ly = dl/dy,lx = dl/dx of the image intensity at ev- 
ery location of the image. 

Orientation histogram. This operation takes the 
two gradient values ly , at every location and 
smoothly bins them into histogram of eight orientations 
(0, 7r/4, . . . , 7tt/A) as follows. First we calculate the ori- 
entation angle (j) = arctan(/j,//2;) + 7r(l — sign(a::))/2 

and magnitude m = + 1^. Let 4>h{n) = 1111/4,, 

n — 0, . . . , 7. The final set of values is v{n) — m * 
cos(0 — (j)h{n))% where the = a; if a; > and 
otherwise. Most of these operations are computation- 
ally expensive and therefore we precompute these val- 
ues. We bin the ly and Ix values into 500 bins each and 
for every combination (500^ values) we calculate v{n), 
n = 0, . . . , 7. The bin range is chosen so that the values 
of ly and Ix never fall outside the range of the binning so 
no checks are needed. After this computation we obtain 8 
values at every location of the image. 

Smooth subsampling We subsample the resulting fea- 
tures by two in each direction. Specifically let Vn^y^x be 
the input value obtained from the previous step, where n 



is the feature number and y,x is the location. The output 

value will be U„^y ^2; = Vnay,2x+Vn,2y,2x+l+Vn,2y+l,2x + 

Vn,2y+i.2x+i- This is efficient since it only involves addi- 
tions. Note that it results in output values that are essen- 
tially four times larger the input values at each location. 

Smoothing We convolve each feature with 
[[1, 1], [1, 1]] filter. This is calculated using 

Un.y.x — Vn,y,x~\- 'Vn,y,x + 1~\~ 'Vn,y+l,x~\~Vn,y+l,x+l agam 

resulting in essentially four times larger output values 
then input values. 

Combining and normalizing into sift vector Now 

we obtain 128 component sift vector from every loca- 
tion of the features maps from the previous step. At ev- 
ery location (x, y) (of the subsampled feature image) we 
first obtain 128 component vector by concatenating the 8- 
component vectors at the following locations (x + 2«, y + 
2j), i = 1,2,3,4 and j = 1, 2, 3, 4. Then we normalize 
this vector as follows. If the norm of the vector is smaller 
then the threshold t/j = 1 we keep the vector. If it is 
larger we normalize it to have size th- The result is placed 
into the appropriate location of the final niy x uix x 128 
vector, where ruy^x ~ ny,x/'2 where Ux^y are the dimen- 
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sions of the original image. The dimensions are sHghtly 
smaller due to boundary effects. This last operation (com- 
bining and normalizing) is the most expensive operation 
in the sift grid calculation and we took care to implement 
it efficiently. Note that in Lazebnik's (and Lowe's origi- 
nal) sift the smoothing is done over a larger neighborhood 
with inputs near the center weighted more then those fur- 
ther. This makes the output more smoothly varying un- 
der translations; in our case we used equal weighting over 
small neighborhoods for computational efficiency. 

3.2.2 Hashed sparse coding. 

We used the main procedure of this paper to calculate 
feature vector for each sift vector. Each such computa- 
tion consisted essentially of depth=16 multiplications of 
sift and tree decision vectors (16 x 128 computations) 
followed by multiplication of the sift vector by the ap- 
propriate pseudo-inverse matrix (typically 128 x 5 mul- 
tiplications) resulting in total of approximately 128 x 21 
multiplications. For 2048 dimensional feature vector this 
compares to 128 x (2048 + 4) multiplications that are 
needed for omp resulting in almost 100-fold reduction. 
Our model was trained on 2 x 10^ randomly selected sift 
vectors from Pascal 201 1 dataset. 

3.2.3 Spatial pyramidal pooling. 

We used the same spatial pyramidal max pooling as in [Y- 
Lan]. Since the feature vectors are in the sparse format 
the resulting computation is very efficient and negligible 
compared to either sift or tree sparse coding. The details 
are as follows. We need to calculate the maximum over 
the features in 1 x 1, 2 x 2 and 4x4 regions of the feature 
vector obtained in the previous step. First we split this 
vector into 4x4 regions Rx'.y'- Let ti/ be number of 
features, typically 2048, w/.^.y be the input feature vector 
and x' = 1,2,3, 4, y' = 1, 2, 3, 4 be the 4 x 4 

part of the final feature vector. We calculate u using the 
following. 

Uf^x',y' ^'CCL&^x,y&R^, ^,Vf^x,y (3) 

This calculation is done by looping over all feature vectors 
and indices and filling the pooled feature vector so the 
number of computations is of the order of the total number 
of nonzero features. We can get 2x2 and 1x1 parts of 



the final feature vector analogously. However it is more 
efficient now to use the 4x4 vector obtained and pool 
it into 2x2 regions and then pool the result into 1x1 
regions. The final output vector is concatenation of these 
vectors, resulting in ny x 21 vector. 

3.2.4 Classification. 

Subsequently a logistic regression classifier is trained on 
the feature vectors using the hbhnear package fT9l . 

3.2.5 Implementation. 

Each the following operations we implemented using a 
multicore processing: all steps of the sift, finding the 
group using tree, and multiplying by pseudo-inverses. In 
each of these steps separately the image/feature image 
was split in ncores parts and send to different core. The 
system was implemented in C-H-. Bias in the Accelerate 
framework was used in the tree sparse coding. We report 
the result on a macbook pro, with a 2.3 Ghz Intel Core i7 
processor with 4 cores. The observed speedup compared 
to single core was about 3. 

We also test the run time of just the coding, compared 
with coding using OMP with the SPAMS package lEoll . 

3.3 Accuracy on Caltech 101 and 15 scenes 

We test the accuracy of the standard pipeline with the 
hashed dictionary and with standard /q sparse coding on 
two object recognition benchmarks, Caltech 101 [] and 
15 scenes []. As mentioned before, for all data sets, we 
train the hashed dictionary on 2 x 10^ randomly selected 
sift vectors from the Pascal 2011 dataset. Caltech 101 
consists of 101 image categories and approximately 50 
images per category; many classes have more training 
examples and we do the usual normalization of error by 
class size. We use 30 training examples per class. The 15 
scenes database contains 15 categories and 4485 images, 
and between 200 to 400 images per category. We use 100 
training images per class on this data set. For each data 
set, we run over 10 random splits and record the mean and 
standard deviation of the test error We record the results 
in Tables [3741 and l3.4l The first two columns of each table 
correspond to the hashed sparsed coding run with 5 or 10 
nonzero entries on Lazebnik's sift. The next two columns 
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correspond to the "real time" system, hashed sparse cod- 
ing run on our approximate sift, and the last two columns 
correspond to OMP, trained and coded with SPAMS [ 20 1 
on Lazebnik's sift. Each row corresponds to the number 
of atoms in the dictionary. As far as we know, state of 
the art with single features on grayscale images on Cal- 
tech 101 with 30 training examples per category is .773, 
in f7l, and .898 for the 15 scenes, in |21|. Both of these 
methods use the same basic pipeline as this work, but with 
variations on the sparse coding; our method can be used 
in conjuction with their methods. 

As has been observed by other authors, increasing the 
size of the dictionary only seems to increase the accuracy. 
Note that for our method, the only places that the size of 
the dictionary affects the computational cost is in train- 
ing, where we use an SOMP, and in the final classification 
stage. The last component is small for these experiments, 
but if we wanted to use the system for detection at many 
locations at an image, it would start to be significant. 

3.4 Running speed. 

We tested the speed of the full pipeline from image to 
classification. We show results on images from the Berke- 
ley dataset and Caltech 101. The Berkeley images are 
321 X 481, The Caltech 101 images were resized so that 
the largest size was at most 300, with the aspect ratio 
fixed. With 5 nonzero coefficients and depth 16 tree, we 
get the results in Table [3741 The entire dataset of 9145 
images in Caltech 101 was processed in 4 minutes and 
48 seconds with 2048 features and in 5 minutes and 35 
seconds with 8092 features. This corresponds to 31.75fps 
and 27.3fps respectively. 

We also test the speed of just the sparse coding. Cod- 
ing 15000 sift vectors with a depth 16 tree and 5 nonzeros 
per X takes .034 seconds with one core, and .018 with 
four In comparison, SPAMS with a dictionary of size 
1024 costs .898 seconds using four cores. This is not ex- 
actly a fair test, as SPAMS must calculate a Cholesky de- 
composition of the Gram matrix of the dictionary when 
it runs, and this could be cached; however, simply multi- 
plying the dictionary matrix by the data vectors takes .294 

'This test was done on a quad core intel i5 running 64 bit Linux, with 
4 gigs of ram; both our code and SPAMS were run as a mex file tlirough 
Matlab 



seconds. As the size of the dictionary increases, this will 
increase, but our method will not get any slower 

4 Conclusion 

In this paper we presented a fast approximate sparse cod- 
ing algorithm and use it to build an accurate real time ob- 
ject recognition system. Our contributions can be summa- 
rized into four parts. 1) We describe a general method for 
learning the groups for greedy structured sparse coding 
using a generalization of LLoyd's algorithm and SOMP. 
2) We use this method to design a fast approximation of 
greedy sparse coding that uses a tree structure for infer- 
ence. 3) We give a fast approximate implementation of the 
SIFT descriptor 4) These algorithms together allow as to 
build a real time object recognition system in the frame- 
work of (|2l. It processes the entire Caltech 101 dataset in 
under 5 minutes (with images resized so that larger size is 
300 pixels). As far as we know this is the first time that 
a fast implementation of this type of system has been put 
together with comparable accuracy. 

We see many possible directions in the future both for 
improving the group sparse coding algorithm and apply- 
ing our system to vision. We would like to learn the hash 
or tree, rather than build it before the dictionary training. 
We would like to train the system on larger datasets and 
work on real time object detection (as opposed to classi- 
fication). At this speed the algorithm allows us to process 
around 2 million medium sized images (300 x 400) in 24 
hours on a single computer The object detection should 
also be feasible given that the expensive part calculation 
of features at different parts of the image from which de- 
tection is calculated - is fast. 
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hashed m = 5 


hashed m = 10 


hashed m = 5, R.T. 


hashed m = 10 R.T. 


OMP m = 5 


OMP m = 10 


K = 1024 
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.697 ± .010 


.725 ± .008 


.721 ± .010 


K = 2048 


.735 ± .007 


.731 ± .011 


.723 ± .007 


.716 ± .005 


.747 ± .008 


.738 ± .008 


K = 4096 


.741 ± .011 


.740 ± .006 


.736 ± .005 


.724 ± .004 


.754 ± .008 


.757 ± .010 


K = 8092 


.751 ± .009 




.739 ± .003 









Table 1: Caltech accuracies and standard deviations over 10 random splits. The first two columns of each table 
correspond to the hashed sparsed coding run with 5 or 10 nonzero entries, on Lazebnik's sift. The next two columns 
correspond to the "real time" system, hashed sparse coding run on our approximate sift, and the last two columns 
correspond to OMP, trained and coded with SPAMS l|20l on Lazebnik's sift. Each row corresponds to the number of 
atoms in the dictionary. 





hashed m = 5 


hashed m = 10 


hashed 171 — 5, R.T. 


hashed m = 10 R.T. 


OMP m = 5 


OMP m = 10 
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.792 ± .006 
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.786 ± .004 
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K = 2048 
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.826 ± .007 


.822 ± .007 
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Table 2: 15 scenes accuracies and standard deviations over 10 random splits. The first two columns of each table 
correspond to the hashed sparsed coding run with 5 or 10 nonzero entries on Lazebnik's sift. The next two columns 
correspond to the "real time" system, hashed sparse coding run on our approximate sift, and the last two columns 
correspond to OMP, trained and coded with SPAMS [20 j on Lazebnik's sift. Each row corresponds to the number of 
atoms in the dictionary. 





321 X 481 pixel images 


Caltech 101 (on 4 cores) 




1 core (s) 


4 cores (s) 


1 core (fps) 


4 cores (fps) 


total time (m:s) 


(fps) 


performance 


SIFT 


0.039 


0.017 


25 


59 








SIFT+TreeSC+pyramid 


0.143 


0.045 


7 


22.5 








full (1024) 


0.145 


0.0465 


6.9 


21 


4:01 


38 


.710 ± .007 


full (2048) 


0.1473 


0.050 


6.8 


20 


4:45 


32 


.723 ± .007 


full (4096) 


0.1495 


0.052 


6.7 


19 


4:42 


32 


.736 ± .005 


full (8092) 


0.155 


0.0565 


6.4 


18 


5:35 


27 


.739 ± .003 



Table 3: Speeds of different parts of the system and different dictionary sizes on 321 x 481 pixel Berkeley dataset 
images and Caltech 101 images. The times are for single frame in seconds. Frame rates are the inverses and are in 
frames per second. The total time is the time to process the entire Caltech 101 datasets consisting of 9145 images 
(minutes:seconds). The Caltech 101 images were pre-resized so that largest side is 300 pixels. The last column is 
the recognition performance when trained on 30 training images per category. (The speeds vary probably due to disc 
access and are faster after one or more sweeps through the dataset). 
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